Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration #1141

Merged

Conversation

amousavigourabi
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Additional parameters to run the read/write tests using the new interface-using methods.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

jacicmp exclusions have been added for the following classes: org.apache.parquet.hadoop.CodecFactory, org.apache.parquet.hadoop.ParquetReader. When these exclusions are removed, the following incompatibilities are detected:

There is at least one incompatibility: org.apache.parquet.hadoop.CodecFactory.configuration:FIELD_TYPE_CHANGED,org.apache.parquet.hadoop.ParquetReader$Builder.conf:FIELD_TYPE_CHANGED

This PR is part of an effort that has been discussed on the dev mailing list.

@amousavigourabi amousavigourabi force-pushed the add-interface-layer-for-configuration branch from aeb2348 to a2ce8af Compare September 17, 2023 21:54
Add interface layer for the Hadoop Configuration.
@amousavigourabi amousavigourabi force-pushed the add-interface-layer-for-configuration branch from a2ce8af to 7db4478 Compare September 19, 2023 16:35
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I have left some inline comments.

BTW, I am also curious of the following things.

  • It seems that we still have some compatibility issues. Can you confirm? If yes, could you please write them out explicitly?
  • Is there any follow-up work item to do? Would be good if we can know the whole picture in advance.
  • Is it possible to add a simple test case to prove that a simple writer and reader roundtrip can work successfully without Hadoop dependency?

if (conf instanceof HadoopParquetConfiguration) {
return ((HadoopParquetConfiguration) conf).getConfiguration();
}
Configuration configuration = new Configuration();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will it happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using a HadoopParquetConfiguration, the user did not yet decouple from Hadoop as it is just a wrapper for Configuration. When the user wants to decouple from Hadoop, they can implement their own ParquetConfiguration, which does not rely on Hadoop's Configuration (or a simple implementation is added afterwards, this PR was getting a bit large for that already). There is still some code right now, mainly around the codecs which needs a Hadoop Configuration. It is therefore important that while we're still removing these last references to Hadoop, we can get such an instance from a ParquetConfiguration, in order not to break anything.

pom.xml Outdated Show resolved Hide resolved
@amousavigourabi
Copy link
Contributor Author

@wgtmac thanks a lot for the review! It's quite a big one so I really appreciate you took the time for it.

To address your concerns:

  • It seems that we still have some compatibility issues. Can you confirm? If yes, could you please write them out explicitly?

Correct, japicmp comes up with two incompatibilities. These are in the classes CodecFactory and ParquetReader. The incompatibilities it points to are both changes of private and protected field types from Configuration to ParquetConfiguration. These changes are strictly necessary for the effort to unhadoop the read/write API.

  • Is there any follow-up work item to do? Would be good if we can know the whole picture in advance.

After this, the following steps that will need to be taken are: 1. the creation of a simple unhadooped implementation of the ParquetConfiguration interface, and 2. adding a simple way for users to avoid the Hadoop codecs, as the OOTB implementations of everything still rely very heavily on Hadoop classes. These changes should allow users to drop the Hadoop runtime dependency. The Hadoop client API dependency will still be necessary.

  • Is it possible to add a simple test case to prove that a simple writer and reader roundtrip can work successfully without Hadoop dependency?

We do not yet have any serious ways for users to use the API without Hadoop dependencies. The added parameters to the TestReadWrite fixture make sure the read/write API should still function when using the ParquetConfiguration interface.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detail explanation! This LGTM. Probably you can also create JIRA issues in advance for all planned work items.

I would listen to feedback from more maintainers to proceed. @gszadovszky @shangxinli @ggershinsky @Fokko

@@ -547,6 +547,8 @@
</excludeModules>
<excludes>
<exclude>${shade.prefix}</exclude>
<exclude>org.apache.parquet.hadoop.CodecFactory</exclude> <!-- change field type from Configuration to ParquetConfiguration -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: can we remove this in a future PR? Only this PR introduces the breaking change here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that well versed in how japicmp works exactly, but I reckon this can be removed after the next minor release.

@wgtmac
Copy link
Member

wgtmac commented Oct 12, 2023

@shangxinli @gszadovszky Do you have any comment? If not, I plan to merge this next week. Thanks!

@amousavigourabi Could you please resolve the conflicts? Thank you!

@gszadovszky
Copy link
Contributor

@wgtmac, @amousavigourabi, sorry, but I won't have the time to properly review this PR in the upcoming days.
The concept looks good to me so I am giving a +0 for this patch.

*
* @deprecated override {@link ReadSupport#init(InitContext)} instead
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to add a deprecated method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is focussed on transitioning from Configuration to the ParquetConfiguration interface. This included some calls to deprecated methods which I could not very quickly transition away from. I would consider this out-of-scope for this PR.

Map<String, String> keyValueMetaData,
MessageType fileSchema,
ReadContext readContext) {
throw new UnsupportedOperationException("Override prepareForRead(ParquetConfiguration, Map<String, String>, MessageType, ReadContext)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the example set by ReadSupport#init(Configuration, Map, MessageType). As this error will not occur unless you are implementing your own ReadSupport class, I am not sure whether there needs to be that much more information in the exception. I'll add a reference to the ReadSupport class though.

* @return the information needed to write the file
*/
public WriteContext init(ParquetConfiguration configuration) {
throw new UnsupportedOperationException("Override init(ParquetConfiguration)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@ConeyLiu
Copy link
Contributor

@amousavigourabi we have added so many public methods to keep the backward compatibility. So which one is preferred (I think it should be theParquetConfiguration)? Should we deprecate the old one?

@amousavigourabi
Copy link
Contributor Author

@amousavigourabi we have added so many public methods to keep the backward compatibility. So which one is preferred (I think it should be theParquetConfiguration)? Should we deprecate the old one?

the ParquetConfiguration methods are indeed preferred. I would be in favour of deprecating the Hadoop Configuration methods. However, I do think deprecating these methods might warrant a separate discussion on the mailing list (as their deprecation and subsequent removal would go further than what we have previously discussed, it would be a step in the direction of eliminating the need for hadoop-client-api altogether instead of just hadoop-client-runtime) and does not necessarily need to be a part of this PR.

@wgtmac
Copy link
Member

wgtmac commented Oct 14, 2023

Thank you for the review @ConeyLiu. This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet.

Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @amousavigourabi for the response and your efforts. LGTM.

This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet.

@wgtmac This is a good start. It would be nice to have a best practice guide doc to help users understand the different configuration usage.

@wgtmac
Copy link
Member

wgtmac commented Oct 15, 2023

I totally agree with you. @ConeyLiu
We might need to consolidate different docs for configuration like this (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md) into a better space.

* @param name the property to set
* @param value the value to set the property to
*/
void set(String name, String value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setString()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went with the way Hadoop's Configuration does it for this and other methods in the interface, in order to not have to change references to conf.set() everywhere.

@wgtmac
Copy link
Member

wgtmac commented Oct 26, 2023

I will merge this if there is no more comment by the end of this week. Thanks @amousavigourabi for working on this!

@wgtmac wgtmac merged commit 1b08469 into apache:master Oct 30, 2023
9 checks passed
@ConeyLiu
Copy link
Contributor

This broke the CI checking.

private static final Map<Class<?>, Constructor<?>> CONSTRUCTOR_CACHE = new ConcurrentHashMap<Class<?>, Constructor<?>>();

@SuppressWarnings("unchecked")
public static <T> T newInstance(Class<T> theClass) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this breaks the API checking. However, this class seems redundant since there is not any usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is strange. Should we exclude this class in the japicmp config? @amousavigourabi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is no current usage, we can get away with deleting this class for now. My main concern with the CI failure is the incompatibility at org.apache.parquet.conf.ParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED japicmp complains about. Especially as this did not come up before and the interface is completely new I have no idea why this is happening. Neither of the matters japicmp flags should be an issue AFAIK, so that might be something to investigate in our configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a bug of japicmp. Because both of the methods are generic template methods. And it should be safe to suppress the checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a new PR shortly, deleting the unused stuff and suppressing the warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has also started failing at other completely new methods in the Parquet Hadoop module: org.apache.parquet.conf.HadoopParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED,org.apache.parquet.hadoop.util.SerializationUtil.readObjectFromConfAsBase64(java.lang.String,org.apache.parquet.conf.ParquetConfiguration):CLASS_GENERIC_TEMPLATE_CHANGED. Might be something to report to the japicmp maintainer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested 0.16.0 has no problem. A little strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the current 0.18.1 is more aggressive than 0.16.0

@amousavigourabi amousavigourabi deleted the add-interface-layer-for-configuration branch November 23, 2023 14:05
@@ -48,7 +51,7 @@ public class CodecFactory implements CompressionCodecFactory {
private final Map<CompressionCodecName, BytesCompressor> compressors = new HashMap<>();
private final Map<CompressionCodecName, BytesDecompressor> decompressors = new HashMap<>();

protected final Configuration configuration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the Iceberg build to break when bumping to 1.14.0-SNAPSHOT:

> Task :iceberg-parquet:compileJava FAILED
/Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:60: error: cannot find symbol
        codecClass = configuration.getClassLoader().loadClass(codecClassName);
                                  ^
  symbol:   method getClassLoader()
  location: variable configuration of type ParquetConfiguration
/Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:62: error: incompatible types: ParquetConfiguration cannot be converted to Configuration
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, configuration);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to fix this on the iceberg side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway this is a breaking change. We have to fix it before the release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Iceberg this is not an issue, since we would remove this class with the 1.14 release (that contains #1134):
https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java#L28-L32

In Iceberg we shade Parquet, so this will not interfere with other libraries that use a different version of Parquet. I'm curious what if we also have this issue on the Spark side (cc @vinooganesh).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants